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Abstract — Covarion models of character evolution describe 
inhomogeneities in substitution processes through time. In phy- 
logenetics, such models are used to describe changing functional 
constraints or selection regimes during the evolution of biological 
sequences. In this work the identifiability of such models for 
generic parameters on a known phylogenetic tree is established, 
provided the number of covarion classes does not exceed the size 
of the observable state space. 'Generic parameters' as used here 
means all parameters except possibly those in a set of measure 
zero within the parameter space. Combined with earlier results, 
this implies both the tree and generic numerical parameters are 
identifiable if the number of classes is strictly smaller than the 
number of observable states. 

Index Terms — phylogenetics, Markov processes on trees, co- 
varion models, statistical consistency 

I. Introduction 

Phylogenetic inference is now generally performed in a 
statistical framework, using probabilistic models of the evo- 
lution of biological sequences, such as DNA or proteins. 
To rigorously establish the validity of such an approach, a 
fundamental question that must be addressed is whether the 
models in use are identifiable: From the theoretical distribution 
predicted by the model, is it possible to uniquely determine 
all parameters? Parameters for simple models include the 
topology of the evolutionary tree, edge lengths on the tree, and 
rates of various types of substitution, though more complicated 
models have additional parameters as well. If a model is non- 
identifiable, one cannot show that performing inference with 
it will be statistically consistent. Informally, even with large 
amounts of data produced by an evolutionary process that was 
accurately described by the model, we might make erroneous 
inferences if we use a non-identifiable model. 

Identifiability for the most basic phylogenetic models, such 
as the Jukes-Cantor, Kimura, and all other time-reversible 
models, follows from Chang's work on the general Markov 
model [5]. However, for models with rate variation across 
sites, where the distribution of rates is not fully known, only 
recently have the first positive results been obtained [2], [3], 
[1]. Despite its widespread use in data analysis, identifiability 
of the GTR+r+I model has yet to be addressed rigorously. 
(Unfortunately the proof of identifiability given in [17] has 
fundamental gaps, as explained in the appendix of [1].) 

The covarion model, introduced in its basic mathematical 
form by Tuffley and Steel [19], incorporates rate variation 
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within lineages rather than across sites. Extensions of the basic 
version of the model have appeared in a variety of analyses 
of experimental data, with authors referring to the model 
using terminology such as 'covarion' [12], 'covarion-like' [7], 
[20], 'site-specific rate variation' [7], [10], 'Markov-modulated 
Markov process' [8], [9], or 'temporal hidden Markov models' 
[21]. We use the name 'covarion' in this paper for simplicity, 
although we acknowledge the model does not capture the 
full complexity of the process originally proposed by Fitch 
and Markowitz [6]. Informally, the covarion model allows 
several classes (e.g., invariable, slow, and fast), with characters 
evolving so they not only change between observable states, 
but also between classes. Though the class is never observed, 
it affects the evolutionary process over time. The model thus 
attempts to capture the fact that substitution rates may speed up 
or slow down at different sites in a sequence at different times 
in their descent. Changing functional constraints or selection 
regimes are possible sources of such a process. 

Identifiability of even the tree parameter under the covarion 
model was not established with its introduction in [19], despite 
strong efforts. In [2], the authors established that for generic 
choices of covarion parameters tree topologies are indeed 
identifiable, provided the number of covarion classes is less 
than the number of observable states. Thus for nucleotide 
models of DNA there can be 3 classes, though for amino acid 
models of proteins one can allow 19 classes, and for codon 
models of DNA up to 60 classes. 'Generic' here means that 
there could be some parameter choices for which identifiability 
fails, though they will be rare (of Lebesgue measure zero). 
In fact, if parameters are chosen randomly, with any natural 
notion of random, one can be sure the tree topology is 
identifiable. 

Since the notion of generic identifiability is perhaps not 
widely known, and will play a key role in this work as well, 
we elaborate on its meaning. For statistical models in general, 
it is most desirable to establish identifiability over the full 
parameter space. However, such a strong claim may not hold, 
so that the best possible result is to establish identifiability over 
most of the parameter space, and completely characterize all 
those parameter choices for which identifiability fails. Generic 
identifiability results are a little weaker than this, in that while 
identifiability is established over most of the parameter space, 
they allow for ignorance about identifiability on a small subset 
of the parameter space. This exceptional subset of parameter 
space contains all parameters for which identifiability fails, 
but may also contain some parameters that are identifiable. 
Complex statistical models can be quite difficult to analyze, 
so that generic identifiability is sometimes the strongest known 
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result. For instance, though hidden Markov models are widely 
used in bioinformatics and other fields, and generic iden- 
tifiability was proved for HHMs in [16], we know of no 
improvements on that work in the nearly 40 years since it 
appeared. Phylogenetic models are similar to HMMs in that 
they posit unobserved variables, at the internal nodes of a 
tree, but typically have more complex parameterizations than 
HMMs. Thus we consider their analysis to be even more 
challenging. 

The question of identifiability of numerical parameters for 
the covarion model was left open by [2]. In this article, we 
assume the tree topology is known, and establish identifiability 
of the numerical parameters of several variants of the covarion 
model for generic parameter choices, provided the number of 
covarion classes is strictly less than the number of observable 
states. For certain versions of a covarion model, this can be 
strengthened to allow one more class, so that the number of 
classes and observable states may be the same. 

We consider three variants of the covarion model, which 
extend the Tuffley-Steel model, and have previously appeared 
in works of others, though without our formal terminology: 
The scaled covarion model, sCov, assumes all classes undergo 
substitutions according to a common process but at rescaled 
rates. The equal stationary distribution covarion model, eCov, 
generalizes this to allow in-class substitution processes to vary 
more across classes, provided they have identical stationary 
distributions and class change rates are independent of the 
base. Finally, in the general covarion model, Cov, each class 
may undergo substitutions quite differently as long as the 
entire process is time reversible. Cov is the model described 
in [21], eCov is developed in [9], and sCov is used in [10]. 

Note these models are nested, 

sCov c eCov c Cov, 

though each submodel is non-generic within its supermodels. 
Because identifiability is established here only for generic 
parameters, it is necessary to state and prove the generic 
identifiability of all three covarion models to encompass the 
range of models used in practice. 

In Section [TT] we formally present these models, and in 
Section [HI] we state our results precisely. That section also 
provides an overview of the proof. For those whose primary 
interest is understanding the result, and who do not wish 
to delve into the full mathematical arguments behind it, we 
suggest that reading through Section [Til] may suffice. The 
remainder of the paper provides the rather detailed arguments 
that are essential to rigorously establishing identifiability. 

We also note that many practitioners have conducted data 
analysis with models combining covarion features with across- 
site rate variation, such as that modeled by a discrete T distri- 
bution. While the identifiability of such models has not been 
established rigorously as of yet, we view the main theorems 
of this paper as providing a first step toward understanding of 
these more complex models. 

This work was influenced by many useful discussions 
concerning covarion models that we had with participants 
of the Isaac Newton Institute's Programme in Phylogenetics. 



Simon Whelan deserves particular thanks for explaining his 
forthcoming work [21]. 
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II. The Parameterization of the Covarion Models 

For the purpose of orientation, we briefly recall a simpler 
phylogenetic model, the K-state general time reversible (GTR) 
model. The basic state change process is specified by a 
k x k rate matrix Q, whose off-diagonal zj-entry gives an 
instantaneous rate (> 0) at which a character in state i enters 
state j. Each row of Q must add to 0. As a consequence, Q has 
a unique left eigenvector tt with eigenvalue 0, the stationary 
vector for Q. Time reversibility is mathematically formulated 
as the assumption that diag(7r)Q is symmetric. Character 
change along a rooted metric tree T is then modeled as 
follows: The entries of tv give the probability that a character 
is in the various states at the root of the tree. Along each 
edge e of T, directed away from the root, the conditional 
probabilities of state changes are given by the Markov matrix 
M e — exp(Qt e ), where t e > is the edge length. From 
this information one can compute the probability of any 
specification of states at the leaves of the tree. Due to the 
time reversibility assumption, the location of the root within 
the tree actually has no effect on this probability distribution. 
Thus the parameters of the model are the topology of the 
unrooted tree T, the collection of edge lengths {t e }, and the 
rate matrix Q. 

To present the covarion models, we first focus on the process 
of state change. It will be convenient to adopt terminology 
most appropriate to nucleotide sequences. In particular, in 
discussing covarion models we limit our use of the word 'state' 
which is commonly used for all Markov models, because the 
number of states at internal nodes of a tree differs from that at 
leaves, even though there is a relationship between them. We 
instead refer to observable states as 'bases,' and to rate classes 
as 'classes.' Thus at a leaf a state is simply a base, while at an 
internal node a state is a pair of a class and a base. We caution 
the reader that this usage of 'base' is not standard in biology, as 
it encompasses the 4 bases in nucleotide sequences, as well as 
the 20 amino acids of protein sequences, and the 61 codons in 
a model of codon substitution. Also, while it is often natural to 
think of 'classes' as being associated to rate scalings, this may 
be misleading, as several of the models we formalize allow for 
more generality. We use [k] = {1, 2, . . . , «} to denote the set 
of bases and [c] = {1,2, ... ,6} to denote the set of classes. 

To refer to entries of vectors and matrices of size en, it will 
be convenient to index entries using interchangeably the set 
[ck], and the set [c] x [k] with lexicographic order. Thus the 
index which should be interpreted as the 'class i, base 

j' index, is equivalent to (i — l)c + j. Entries in a ck x ck 
matrix, then, can be referred to by an ordered pair of indices, 
each of which is an ordered pair in [c] x [k] . 

Let c, K be positive integers. The most general c-class, 
K-base covarion model, introduced by Whelan in [21], is 
specified in the following way: 
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(1) For each i e [c], a base-change process for class i 
is described by a rate-matrix Qi of size k x k. We 
assume all Qi are distinct, so that no two classes undergo 
substitutions at the same rates. For c — 1 values of i we 
require that the off-diagonal entries of Qi are strictly 
positive so that all substitutions are possible, and the 
rows sum to 0. For the remaining Qi we only require that 
all off-diagonal entries be non-negative and that rows 
sum to 0. In particular, we allow Qi for at most one i to 
be the 0-matrix, in order to model an invariable class. 

(2) For each ordered pair of classes i\ ^ i 2 , a diagonal 
matrix Si 1 i 2 of size kxk describes switching rates from 
class i\ to class i 2 . The entries of Si 1 i 2 are non-negative. 
The requirement that Si x i 2 be diagonal will imply that 
instantaneous base switches do not occur simultaneously 
with class switches. 

(3) Let R be the ck x ck matrix which, when viewed in 
c x c block form, has as its off-diagonal i^^-block 
Si x i 2 and as its ith diagonal block Qi — J2i 2 ^u 2 - Note 
each row of R sums to 0. We require that R describe a 
time-reversible process; that is, for some vector fi with 
positive entries summing to 1 the matrix 

diag(/x)i? 

is symmetric. 

We may rescale R, or equivalently all entries of the Qi and 
S ili2 , so that 

trace(diag(/x)i?) = —1. 

Requiring this normalization avoids a trivial non-identifiability 
issue in which rescaling of edge lengths would have the 
same effect as rescaling R. It also imposes a scale on edge 
lengths so that the average instantaneous rate of (base,class) 
changes under the Markov process is 1 per unit of edge 
length. We will assume throughout the rest of this paper that 
this normalization has been made. Consequently, if two such 
matrices are multiples of one another, we may conclude they 
are equal. 

Any matrix R with these properties will be called a covarion 
rate matrix for the general covarion model, Cov(c, k), with c 
classes and k bases. 

We may write 

A« = (ciTl, (T 2 TT 2 , • ■ • , ^c^c) 

where the 7Tj e M K and er = (a 1} . . . , a c ) e R c are 
vectors of positive entries summing to 1. Then the symmetry 
of diag(/x)i? implies the symmetry of diag(7Ti)Qi for each 
i. Thus our assumptions ensure the Qi each define time- 
reversible processes. Additionally we find 

<r n diag(7r 4l )5 nt2 = a l2 diag(7T i2 )5 i2il . (1) 

These conditions are equivalent to the time-reversibility of R. 

A specialization of Cov(c, k) described in [9] assumes 
further that 

(4) The base substitution processes described by the Qi have 
equal stationary distributions, 71^ = n. 

(5) The switching matrices Si 1 i 2 are scalar, so Si 1 i 2 = 



s iii 2 Im where I K is the kxk identity matrix. 
We refer to this as the equal stationary distribution covarion 
model, denoted by eCov(c, k). 

The model eCov(c, k) can also be conveniently described 
in tensor notation. For any vectors or matrices A = {ai x i 2 ) 
and B = (bj 1 j 2 ), let A ® B denote the tensor, or Kronecker, 
product. Using ordered-pair indices as above, we order rows 
and columns of A<g>B so the (i 2 ,j 2 ) entry is a ili2 bj 1 j 2 . 

With the class switching process for eCov specified by a 
c x c rate matrix S with off-diagonal entries s ili2 , and rows 
summing to 0, then 

R = diag(Qi, Q 2 , • . . , Q c ) + S <g> I K , 

= <T (g) 7T. 

The symmetry of diag(/x)i? is equivalent to the symmetry of 
each diag(7r)<5i and of diag(er)5. Thus the class switching 
process described by S is time-reversible as well. 

A further specialization from eCov yields the scaled covar- 
ion model, sCov(c, k), which assumes 

(6) For some rate matrix Q and distinct non-negative 
n, r 2 , ...,r c ,Qi = TiQ. 
For this submodel, the full covarion process has rate matrix 

R = dia,g(r 1 ,r 2 ,...,r c )®Q + S®I K . (2) 

Example!: sCov(2,4) is just a generalization of the 
Tuffley-Steel covarion model of nucleotide substitution [19]. 
For any s\, S2 > 0, let 

S=h Sl Sl ), ,, = (0^) 

\S2 -82)' \S!+S 2 S 1 +S 2 J 

Then S defines a time-reversible switching process with sta- 
tionary vector a. For any Q, tv of a 4-base GTR model, taking 
1 = n > r 2 we obtain a rate matrix with block structure 

x {Q-siI si 1 \ 
V s ^ r 2 Q-82l)' 

while 

/U=(<7l7Tl, CT17T2, <Tl7r 3 , CTl7r 4 , a 2 TTl, a 2 TT 2 , (T27T 3 , (J^i). 

If T2 = 0, then an invariable class is included, and this is 
exactly the Tuffley-Steel model. 

Example 2: If c > 3, the requirement for eCov(c, k) that 
the class switching process described by S be time-reversible 
implies stronger relationships among its entries than merely 
requiring rows sum to 0. If 

(-(S12 + S13) S12 S13 \ 

S21 ~(S21 + S23) S 23 

S31 «32 -(S3I+S32)/ 

and cr are such that diag(cr)S' is symmetric, then one can 
show (most easily by using symbolic algebra software, such 
as Maple or Singular) that 

S12S23S31 - S13S21S32 = 0, 

and 

C = ; ; (S21S32, Si 2 S 32 , Si 2 S23)- 

S21S32 + S12S32 + S 12 S 23 
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Let Q\,Q2,Q^ denote K-base GTR rate matrices with a 
common stationary vector tt. Then, up to a scaling factor, the 
matrix 

Qi - (si2 + si 3 )I si 2 7 si 3 I 

S2ll Q2 ~ (S21 + S 2 3)I S23I 

S3ll S 32 I Q3 - (S31 + S32)/ 

is a rate matrix for eCov(3, k) with stationary vector 

fJ, = (<7i7T <7 2 7T cr 3 7r). 

Such models are presented in [9]. 

Example 3: Let Qi,Q 2 denote K-base GTR rate matrices, 
with stationary vectors 7Ti, 7r 2 . Let er = (a%, cr 2 ) be any vector 
of positive entries summing to 1, and s = (si, s 2 , . . . , s K ) any 
vector of positive numbers. Then defining 

S12 = 02 diag(7r 2 ) diag(s), 
S21 = a\ diag(7Ti) diag(s), 

ensures that equation (HJ is satisfied. For suitable A, the matrix 

^ (Qi — S12 S12 \ 
V ^21 Q2 — S21 J 

is thus a rate matrix for the model Cov(2, k), and of the type 
described in [21]. 

To specify any of the covarion models Cov(c, n), 
eCov(c, k), or sCov(c, k) on a topological tree T, in addition 
to R we must specify edge lengths {t e }. These determine 
Markov matrices M e for each edge e of the tree as follows: 
For every internal edge e of the tree, M e = exp(Rt e ) is cky.ch 
and describes (class, base)-substitutions over the edge. Letting 
l c = (l 1 ... l) E R c be a row vector, and I K the n x k 
identity, set 

J=1 T C ®I K = (I K I K ... I K ) T . 

Then on every pendant edge e of the tree, M e = exp(Rt e )J 
is ck x k. Notice that J serves to hide class information, by 
summing over it, so that only bases may be observed. 

Because the process defined by R is reversible, we may 
arbitrarily choose any internal vertex of the tree as the root, 
and using fi as a root distribution compute the joint distribution 
of bases at the leaves of the tree in the usual way for 
Markovian phylogenetic models on trees. For an n-leaf tree, 
this distribution is naturally thought of as an n-dimensional 
K x k x • • ■ x k array. 

Let P = P Cg) 7 K , where P is a c x c permutation matrix. 
Then replacing R by P T RP simply permutes the classes. As 
no information on classes is observed, it is easy to see this 
has no effect on the joint distribution of bases arising from a 
covarion model. Thus we must account for this trivial source 
of non-identifiability. For sCov(c, k) this could be done by 
requiring the r, be enumerated in descending order. However, 
for Cov(c, k) and eCov(c, k) there need not be any natural 
ordering of the Qi. To treat all these models uniformly, we 
will seek identifiability only up to permutation of classes. 

Note that as formulated above, the covarion models gener- 
alize mixture models on a single tree with a finite number of 



classes. Indeed, one need only choose the switching matrix S 
for sCov or eCov to be the zero matrix, or set all Si ± i 2 = 
for Cov, to describe across-site rate variation. However, such 
choices are non-generic — of Lebesgue measure zero within 
the covarion models. Since our main result allows for non- 
generic exceptions to identifiability, we caution that it does 
not rigorously imply anything about across-site rate variation 
models, though it is perhaps suggestive. 

At one point in our arguments we will in fact need an 
assumption that rules out consideration of across-site rate 
variation models. In Lemma [12] we require that the switching 
process for Cov(c, k) is irreducible in the following sense: 
Say class i communicates to class i' when all diagonal entries 
of Su> are positive. Then class irreducibility of R will mean 
that for each pair of classes i ^ i' there is a chain of classes 
i = io, i\i iii ■ ■ ■ 1 in — i' with *fc communicating to ik+i- 
For the models eCov and sCov, this definition is equivalent 
to the usual definition of irreducibility, [11], for the Markov 
process described by the switching matrix S. Moreover, class 
irreducibility of R, together with the assumption that all entries 
of some Qi are non-zero implies irreducibility of R in the 
usual sense. 

Note that class irreducibility holds for generic choices of 
covarion parameters for all three covarion models, as generi- 
cally all diagonal entries of all Sw are non-zero. Therefore, 
despite its important role in establishing the results, we do 
not refer to irreducibility explicitly in statements of theorems 
which only make claims for generic parameter choices. 

III. Statement of Theorems and Overview 
We establish the following: 

Theorem 1: Consider the models Cov(c, k), eCov(c, k), 
and sCov(c, k) on an n-leaf binary tree, n > 7. If the tree 
topology is known, then for generic choices of parameters all 
numerical parameters are identifiable, up to permutation of 
classes, provided c < k for sCov and eCov, and provided 
c < k for Cov. 

Combined with earlier work in [2], this shows: 

Corollary 2: Consider the models Cov(c, k), eCov(c, n), 
and sCov(c, k) on an rt-leaf binary tree, n > 7. Then 
for generic choices of parameters, the tree topology and all 
numerical parameters are identifiable, up to permutation of 
classes, provided c < K. 

In outline, the proof of the theorem is as follows: Section 
HVl addresses basic properties of eigenvectors and eigenvalues 
of a covarion rate matrix, and discusses the form of joint 
distributions from covarion models on 2-leaf trees. This section 
provides preliminary results needed for the main arguments, 
which span the remainder of this article. 

To establish identifiability of model parameters on a par- 
ticular tree, our argument will require that there be a 6-leaf 
subtree with the particular topology shown in Figure Q] It is 
easy to see that any tree with at least 7 leaves contains such 
a 6-leaf subtree. (For simplicity, we chose to state Theorem Q] 
and its corollary for trees of 7 or more taxa, even though they 
also hold for this 6-leaf tree.) 
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Fig. 1 . The 6-leaf tree on which arguments will be based, with edges and 
internal nodes p, p'. 



In Section [V] the main thread of the proof begins. We use 
algebraic arguments built on a theorem of J. Rruskal [15] 
to determine the covarion Markov matrix M = exp(Rtg) 
describing the total substitution process over the central edge 
eg, of length tg, in the tree of Figure Q] up to permutation 
of the rows and columns. This part of our argument is not 
very specific to the covarion model, but rather applies to more 
general models provided the Markov matrices involved satisfy 
some technical algebraic conditions. We therefore must show 
that Markov matrices arising from the covarion model, as 
exponentials of a covarion rate matrix, satisfy these technical 
conditions, at least for generic parameter choices. Though 
this fact is completely plausible, establishing it rigorously 
requires rather detailed work, which is completed in Section 
IVII This part of our argument is the reason Theorem[T]refers to 
identifiability of 'generic' parameters and not all parameters, 
as well as the reason we require c < k. 

Once the Markov matrix on the central edge of the tree is 
identified up to row and column permutations, to determine the 
covarion rate matrix we must determine the correct row and 
column orderings, and take a matrix logarithm. We are able 
to show there is a unique ordering of rows and columns that 
produces a covarion rate matrix in part by taking advantage 
of the pattern of zeros that must appear in such a rate matrix. 
Other facts about rate matrices, such as the non-positivity of 
eigenvalues, also play a role. We obtain an essential piece of 
information on the ordering from the known ordering of bases 
at the leaves of the tree. All this is the content of Section [VTIl 

Finally, once we have determined the covarion rate matrix 
from this central edge, we use it in Section IVIIII to determine 
the sum of edge lengths between any two leaves in the tree. 
By standard arguments, we may then determine the lengths of 
all individual edges in the tree, so all parameters have been 
identified. 

Note that the later steps of our arguments are constructive, in 
that one could apply them to a specific probability distribution 
to explicitly recover the parameters producing it. However, 
Kruskal's theorem is not constructive; it guarantees a unique 
set of parameters but does not indicate a procedure for 
recovering them. A constructive version of Kruskal's theorem 
would give an algorithm for the decomposition of three- 
dimensional tensors into minimal sums of rank 1 tensors. This 
is an interesting but challenging open problem, which would 
have applications in several other areas of applied mathematics 
as well. However, the particular case of Kruskal's theorem we 
use can also be established by a longer argument, which we 
omit, along the lines of the identifiability result in [5]. Using 
that approach one obtains an explicit parameter identification 
procedure that depends on the calculation of eigenvectors for 



en x ck matrices. 

IV. DlAGONALIZING COVARION RATE MATRICES 

We summarize a few basic facts concerning the eigenvectors 
and eigenvalues of a covarion rate matrix R, under the 
hypotheses of the Cov(c, k) model. 

If R is a rate matrix for Cov(c, k) then it is time- 
reversible by assumption. Thus diag(/x)i? is symmetric, and 
diag(/x) 1 / 2 i?diag(/x) -1 / 2 is as well. Therefore 

diag(At) 1 / 2 i?diag(/x)- 1 / 2 = C T BC 

for some orthogonal C and real diagonal B. Letting U = 
Cdiag(zi) 1 ' 2 , we have 

R = U^BU, U- 1 = diagt/xT 1 ?/ 11 . 

If R is class irreducible, then it is irreducible. Thus one of 
its eigenvalues is and the others are strictly negative [11]. 
We may thus assume B = diag(/3i, /?2, ■ ■ ■ , (3 C k), where = 
01 > 02 > • • • > (3ck for generic R. 

Note that for the model sCov(c, n), much more can be 
said about this diagonalization. In [8], it is shown that the 
eigenvectors and eigenvalues for a scaled covarion rate matrix 
R are related to those of Q and certain modifications of S 
through a tensor decomposition. 

We now investigate the implications of the diagonalization 
of covarion rate matrices for 2-taxon probability distributions 
arising from the model. This will be useful for identifying 
edge lengths in Section IVIIII 

Suppose R = U~ 1 BU is the diagonalization described 
above. A 2-taxon distribution, arising from edge length i, is 
described by a k x k matrix 

N = J T diag(/x) exp(itt) J 

= J T diag(/x)f7- 1 exp( J B<)C/J 
= J T U T exp{Bt)UJ 
= (UJ) T exp{Bt)(UJ). 

We formalize this observation with the following lemma. 

Lemma 3: Let R be a covarion rate matrix for Cov(c, n). 
Then R determines a matrix B — diag(/3i, . . . , j3 CK ) with = 
01 > 02 > • • • > f3 C K, and a rank k matrix K of size ck x k 
such that the probability distribution arising from the covarion 
model with rate matrix R on a one-edge tree of length t is 

N = K T exp{Bt)K. 
Proof: It only remains to justify that the rank of K — U J 
is k. However, since U is non-singular, rank K = rank J = k. 



V. Identifying a Markov matrix on the central 
edge 

The basic identifiability result on which we build our later 
arguments is a theorem of J. Kruskal [15]. (See also [14], [13] 
for more expository presentations.) 
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For i = 1, 2, 3, let Ni be a matrix of size r x «;,-, with 
the jth row of iVj. Let [ATi, iV^j -/V 3 ] denote the m x k 2 x k 3 
tensor defined by 



[JVi,JV 3 ,JV 3 ] 



Thus the (ki,k 2 ,k 3 ) entry of [iVi, iV 2 , JV 3 ] is 
nj(fci)n|(fc 2 )nj- (fc 3 ), and this 'matrix triple product' 
can be viewed as a generalization of the product of two 
matrices (with one matrix transposed). 

Note that simultaneously permuting the rows of all the 
Ni (i.e., replacing each N by PN where P is an r x r 
permutation) leaves [Ni, N 2 , N3] unchanged. Also rescaling 
the rows of each iVj so that the scaling factors c* used for the 
n*, i — 1,2,3 satisfy cjcjc| = 1 (i.e., replacing each Ni by 
DiNi, where Di is diagonal and D\D 2 D 3 — I) also leaves 
[Ni, N 2 , N 3 ) unchanged. That under certain conditions these 
are the only changes leaving [N±, N%, N3] fixed is the essential 
content of Kruskal's theorem. 

To state the theorem formally requires one further definition. 
For a matrix N, the Kruskal rank of N will mean the largest 
number j such that every set of j rows of N are independent. 
Note that this concept would change if we replaced 'row' by 
'column,' but we will only use the row version in this paper. 
With the Kruskal rank of N denoted by ranks- N, observe 
that 

rankx N < rank N. 



Theorem 4: (Kruskal) Let ji — rank^ N{. 

h+h+h >2r + 2, 



If 



then \N\,N%,N^[ uniquely determines the N, up to simul- 
taneously permutating and rescaling the rows. That is, if 
[Ni,N 2 ,N 3 ] = [N[, N^Na], then there exists a permutation 
P and diagonal D u with DiD 2 D 3 = I, such that N[ = 

PDiNi. 

We will apply this result to identify parameters of a stochas- 
tic model with a hidden variable. In phylogenetic terms, the 
model is one on a 3-leaf tree, rooted at the central node. A 
hidden variable at the central node has r states, and observed 
variables at the leaves have k\,K2,K3 states respectively. 
Markov matrices Mi, of size r x K{, describe transitions 
from the state at the central node to those on leaf i, with 
observed variables conditionally independent given the state 
of the hidden variable. For each i = 1, 2, 3, let m* denote the 
jth row of Mi. One then checks that the joint distribution for 
such a model is given by 



[v;M 1 ,M 2 ,M 3 ] = > run 



3=1 



1 m~ 



m" 



Corollary 5: Suppose Mi, i = 1,2,3, are r x m Markov 
matrices, and v = (v\, . . . , v r ) is a row vector of non-zero 
numbers summing to 1. Let ji = rank#- Mj. If 



to permutation. That is, [v; Mi, M 2 , M 3 ] = [v'; M[ , M 2 , M 3 ] 
implies that there exists a permutation P such that M[ = PMi 
and v' = vP T . 

Proof: This follows from Kruskal's theorem in a straight- 
forward manner, using that the rows of each Markov matrix 
Mi sum to 1. ■ 

Remark 1: The corollary actually claims identifiability for 
generic parameters, where 'generic' is used in the sense of 
algebraic geometry. To see this, note that for any fixed choice 
of a positive integer ji, those matrices Mj whose Kruskal rank 
is strictly less than ji form an algebraic variety. This is because 
the matrices for which a specific set of ji rows are dependent 
is the zero set of all ji x ji minors obtained from those 
rows. Then, by taking appropriate products of these minors 
for different sets of rows we may obtain a set of polynomials 
whose zero set is precisely those matrices of Kruskal rank 
< ji- 

To apply the Corollary of Kruskal's theorem in a phyloge- 
netic setting, we need one additional definition. Given matrices 
Ni of size r x s and N 2 of size r x t, let 



N = Ni ® r 



N 2 



denote the r x st matrix that is obtained from row-wise tensor 
products. That is, the ith row of is the tensor product of 
the ith row of Ni and the ith row of N 2 . Although we do not 
need a specific ordering of the columns of N, we could, for 
instance, define N by N(i,j + s(k - 1)) = N 1 {i 1 j)N 2 (i, k). 

To interpret this row-wise tensor product in the context of 
models, consider a rooted tree with two leaves, and a Markov 
model with r states at the root, and m states at leaf i, i = 
1,2. Then the transition probabilities from states at the root 
to states at leaf i are specified by an r x matrix Mj of 
non-negative numbers whose rows add to 1 . The matrix M = 
Mi <S) row M 2 will also have non-negative entries, with rows 
summing to 1. Its entries give transition probabilities from 
the r states at the root to the kik 2 composite states at the 
leaves, formed by specifying the state at both leaves. Thus this 
row tensor operation is essentially what underlies the notion 
of a 'flattening' of a multidimensional tensor that plays an 
important role in [4], [2]. 

Kruskal's result will actually be applied to a model on a 
5-leaf tree, by a method we now indicate. For the 5-leaf tree 
shown in Figure [2] rooted at p, suppose Markov matrices Mi 
(not necessarily square) are associated to all edges to describe 
transition probabilities of states moving away from the root. 







. M 3 


M 6 , 




h+h+h >2r + 2, 



then [v; M\,M 2 , M 3 ] uniquely determines v, M\,M 2 , M 3 up 



Fig. 2. Viewing a model on a 5-leaf tree as a model on a 3-leaf tree. 
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Then with 

Mi = M 3 (Mi® row M 2 ), 
M 2 = M 6 (Mi ® row M 5 ), 
M 3 = M 7 , 

we obtain Markov matrices on a simpler 3-leaf tree rooted 
at its central node. Retaining as root distribution the root 
distribution v at p, the joint distribution for this simpler tree 
is [v; Mi, M 2 , M3], The entries of the distribution for the 5- 
leaf tree and the 3-leaf tree are of course the same, though 
one is organized as a 5-dimensional array and the other as 
a 3-dimensional array. However, the reorganization into a 3- 
dimensional array is crucial in allowing us to apply Kruskal's 
theorem. 

Lemma 6: On the 6-leaf tree of Figure [TJ rooted at p, 
consider a Markov model with r states at all internal nodes 
and re states at leaves. Let the state distribution at the root be 
specified by v, and Markov matrices Mi describe transitions 
on edge a directed away from the root, so for internal edges 
the Mi are r x r, and on pendant edges are r x k. 

Suppose in addition 

(1) all entries of both v and v' = vMg are positive, 

(2) the four matrices M 6 (M 4 ® row M 5 ), M g M 6 (M 4 ® row 
M 5 ), M 3 (Mi ® row M 2 ), and M^M 3 (Mi ® row M 2 ), 
where Mg = diag(v')^ 1 Alg diag(v), all have rank r. 

(3) the Kruskal ranks of M7 and Mg are > 2. 

Then Mg, Mr, and v are uniquely determined from the 
joint distribution, up to permutation. That is, from the joint 
distribution we may determine matrices Ng,Nr and a vector 
w with N 9 = PfM g P 2 , N 7 = Pi M 7 , and w = vP x for 
some unknown permutations Pi and P 2 . 

Proof: Note that since the matrices in (2) have rank r, 
which is equal to the number of their rows, they also have 
Kruskal rank r. 

First consider the 5-leaf subtree where edge e§ has been 
deleted, and edges eg and ee conjoined. Then by Corollary 
[5] we may determine vPi and the matrices P X T M 3 (Mi ® row 
M 2 ), P[ M 9 M 6 (M 4 ® row M 5 ), and PfM 7 for some unknown 
permutation Pi. 

Now reroot the tree of Figure [TJ at p' , using root distribution 
v' and matrix Mg on edge eg (directed oppositely), without 
affecting the joint distribution at the leaves. Having done 
this, consider the 5-leaf subtree where edge 7 has been 
deleted. Another application of the corollary determines v'P 2 , 
P^M 6 (Mi ® row M 5 ), P 2 T M^M 3 (Mi ® row M 2 ), and P 2 T M 8 . 

Finally, from the r x re 2 matrices A = Pf M 9 M§(M 4 ® row 
M 5 ) and B = P 2 T M 6 (M 4 ® row M 5 ), which by assumption 
have rank r, we may determine the r x r matrix C = 
Pi MgP 2 : since both A and B have rank r, the equation 
A = CB uniquely determines C. ■ 

Note that for the covarion models, v has positive entries 
by assumption, and since R is time reversible with stationary 
vector v, we will have v' = v and Mg = Mg. Thus condition 
(1) will automatically be satisfied in our application of the 
lemma. 

The only potential obstacle to applying Lemma [6] to the 



covarion model is that we must know that assumptions (2) and 
(3) on the ranks of various products of Markov matrices are 
met. While one would certainly suspect that at least for generic 
choices of covarion parameters there would be no problem, it 
is non-trivial to establish this rigorously. That is the content 
of the next lemma. 

Let {/1, ...,/„} be a finite collection of analytic functions 
with common domain DCC". Recall that the analytic variety 
V = V(fi, . . . , f n ) is the subset of D on which all /, vanish. 
In the next lemma we will use the existence of a single point 
in D \ V to conclude that the V is of strictly lower dimension 
than D. This step may not be familiar to most researchers in 
phylogenetics, so we recall a simpler instance. A powerful 
theorem concerning analytic functions of a single complex 
variable is that if an analytic function / is not identically 
zero, then any zeros of / in the interior of its domain must 
be isolated. Equivalently, if there is a single point zq with 
f(zo) ^ 0, then the zero set of / is a zero-dimensional subset 
of the one-dimensional domain of /. Our argument simply 
uses a generalization of this fact from the theory of functions 
of several complex variables. 

Lemma 7: Identify the stochastic parameter space S of any 
of the models Cov(c, re), eCov(c, re) or sCov(c, re) on the 6- 
taxon tree of Figure [TJ with a full-dimensional subset of IR L so 
that the parameterization map for the probability distribution 
is given by analytic functions. 

Let X C S be the subset on which either at least one of 
the four ere x re 2 matrices arising from cherries, 

exp (P£ 3 ) (exp (Pii ) J ® row exp(Rt 2 )J), 
exp(P(i 3 + i 9 ))(exp(Pii) J ® roto exp{Rt 2 )J), 

exp (Pt 6 ) (exp (Rti ) J ® row exp(Rt 5 )J), 
exp(P(t 6 +i 9 ))(exp(Pi 4 )J® roto exp(Rt 5 )J), 

has rank < ere, or at least one of the two matrices 

exp(Rtr)J, exp(Rt s )J 

on the pendant edges er, e% has Kruskal rank < 2. Then if 
c < re, the set X is a proper analytic subvariety of S, and 
hence of dimension < L. 

Proof: For our argument, it will be convenient to extend 
the set of allowable edge lengths from t{ > to a larger set 
including ti = 0. Once the claim is established allowing zero- 
length edges, we may restrict to positive-length edges (as is 
needed in other parts of our paper). This is simply because the 
original and extended parameter spaces described here have 
the same dimension, so the intersection of a proper analytic 
subvariety of the extended parameter space with the smaller 
parameter space must also be a proper subvariety. 

Consider first the edges ei,e 2 ,es,er in the tree of Figure 
[TJ In Section [VI] below it will be shown that when c < re 
there is at least one choice of a rate matrix R for sCov(c, re), 
and edge lengths ti > 0, t 2 = 0, t 3 = 0, tr > so 
that exp(Pi 3 )(exp(P£i) J ® row exp(Rt 2 )J) has rank ere and 
exp(Rtr)J has Kruskal rank > 2. Assuming this result for 
now, by in addition choosing 

tg = 0, is = tr, te = t 3 , is = t 2 , ti = t\ 



s 



we have found at least one parameter choice for sCov(c, k) 
that does not lie in X s c v 

Since the same R and {tj} arise from parameters for 
eCov(c, k), respectively Cov(c, k), we have also found at least 
one parameter choice for these models that does not lie in 
X cCov , respectively Xcov 

Now observe that the set of parameters for which any one 
of the four specified ck x k 2 matrices has rank < ck is the zero 
set of a collection of analytic functions. Such functions can 
be explicitly constructed by composing the parameterization 
map for each matrix with the polynomial functions expressing 
the ck x ck minors. Similarly, the set of parameters for which 
a pendant edge matrix fails to have Kruskal rank > 2 is the 
simultaneous zero set of a collection of analytic functions built 
from the composition of the parameterization of that matrix 
with the 2x2 minors. Thus the set X is the union of analytic 
varieties, and hence itself an analytic variety. This set cannot 
be the entire parameter space, since we have found one point 
that lies outside it. Therefore X is a proper analytic subvariety, 
as claimed. As such, it is of dimension strictly less than L. ■ 

For all covarion parameters outside the set X of Lemma [7] 
we may apply Lemma [6] and identify M = P^ cxp(Rtg)P2 
and v = fiPi for some unknown permutations Pi , P%. As 
X is of lower dimension than the parameter space, it has 
Lebesgue measure 0. Thus for generic covarion parameters 
we may identify M and v. 

VI. Construction of scaled covarion parameters 

WITH CERTAIN PROPERTIES 

In this section the particular parameter choice needed in the 
proof of Lemma [7] is constructed. We thus consider only the 
model sCov, with the parameters Q, S, and {r^} as described 
in Section HU and R as given by equation 10. We seek values 
of these parameters and of t\, > so that exp(itti) J® row J 
has rank ck and exp(Rtj)J has Kruskal rank at least 2. Note 
that since exp(Rti) J ® r< ™ J is ck x k 2 , it may only have the 
desired rank when c < k. 



One might first consider taking t\ = 0, so 



cxp(Rti)J i 



J = Ji 



J. 



However this ck x k 2 matrix has rank k < ck. Similarly, 
taking £7 = 0, so cxp(Rtr)J — J, fails to produce a matrix 
of Kruskal rank at least 2. Thus we must do more work to find 
the needed example. Our first step is to establish the following. 

Lemma 8: Suppose that for each j 6 [n], the vectors 
appearing as the jth rows of the matrix powers Q rn , m = 
1, ...,c— 1 are independent. Then there exist t\,t>j > 
such that exp(i?ii) J ® rotu J has rank ck and exp(Rt7)J has 
Kruskal rank at least 2. 

Proof: We first show the existence of such a t\. Let 
M = M(t) = cxp(Rt)J. Because of the specific form of J, 
it is easy to see that any dependency relationship between the 
rows of M ® row J is equivalent to k separate dependency 
relationships between rows of M. Specifically, the rows of 
M ® row J are independent if, and only if, for each j S [k] 
the set of the c rows of M with index (j, j), i £ [c], are 
independent. 



Letting Xj (t) denote the c x k submatrix of M (t) consisting 
of the («, j) rows, we claim that some cx c minor of Xj(t) is 
non-zero for all but a discrete set of values of t. Since there 
are only finitely many j to consider, this implies the existence 
of the desired t±. 



Fixing j, for notational ease let 
/xi(*) N 



Xj(t) = 



x(t) = det 



/xi(i) N 



\*c(*)/ 



where the bar denotes projection onto some choice of c 
coordinates, to be specified later, so that x(t) is a specific 
c x c minor of Xj(t). 

Since x(t) is an analytic function, to establish that it is non- 
zero except at a discrete set of points, it is enough to show 
it is not identically zero. Now x(t) is easily evaluated only at 
t = 0, and unfortunately x(0) = since Xi(0) is the standard 
basis vector for all i. We will, however, show x(t) is not 
identically zero by showing the derivative x^ n '(0) is non-zero 
for n = c(c- l)/2. 

To obtain information on the derivatives x^(0), observe 
that M(t) is the solution to the initial value problem M' = 
RM, M(0) = J. Thus xf\0) is the row of R l J. 

Moreover, since Sl^ = 0, 



R l J = (diag(ri,r 2 , 



>Q + S® I K ) l (ll ®7 K ) 



= diag(ri, r 2 , . . . , r c ) l l T c ® Q l + ^ y^ m ® Q m , 

m—l 

for some vectors yz. m . Thus, for / > 1, x^(0) is a linear 
combination of the jth rows of Q m , 1 < m < I, where the 
jth row of Q l appears with coefficient r\. 

Now with n = c(c — l)/2, 



xW(0) - ™Adet (x^O), . . . ,xW(0)) , 



E 

A— (ni ,...,n c ) 



(3) 

where the summation is over non-negative integer solutions 
to ni + • • • + n c = n and m\ = ( n J is a multinomial 
coefficient. Letting zo = and %i be the jth row of Q % for 
i > 1, we have shown that ^(0) lies in the span of {zi}'_g 
for all I > 0. This implies that any summand in equation (O 
must vanish if more than Z + 1 of the rii satisfy n,; < I, since 
in that case the rows in the determinant are dependent. But 
n = c(c — l)/2 = + 1 + • • • + (e — 1), hence non-zero terms 
can arise only when A is a permutation of (0,1, ... ,c— 1). 

With S c denoting the permutations of (1, . . . , c), and m = 

W(o,l,..., c -l), 



(0) = m^ dct 



(xr 1(i) - i} (o), . 

sgn( M )det (x^O),...^^ 



(0) 
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But with Z 



,%c-i) T > we have shown 



(c-l) 



(Q)y 



where i M is a c x c lower triangular matrix with diagonal 



entries L^i = By hypothesis, all rows of Z except the 

first form an independent set, and since Q l l^ = for I > 
1 while zqI^ = 1, the first row is not in the span of the 
others. Thus Z has rank c, and some choice of c of its columns 
are independent. Specifying that the bar over a matrix or row 
vector designates a projection onto these column coordinates 
yields 



p(c) 



Lp,Z , 



so 



det 



(*!>>■■ 



fj,(c) 



(0) = 



dct(Z). 



Since det(Z) ^ 0, to see that x (n) (0) 7^ it is enough to 
show 

fJ,£S c 1=1 

But the left hand side is a Vandermonde determinant, and since 
the ri are distinct, it does not vanish. Thus the desired t\ exists. 

For the existence of £7, consider the (i\,jx) and (12, j'2) 
rows of exp(Rt)J. If ji ^ 32, then these rows are independent 
when t = 0, hence for all t except a discrete set. If j% = J2, 
then the two rows are rows of Xj 1 (t), and thus independent 
for all but a discrete set of t by our work above. Since there 
are only finitely many pairs to consider, for all but a discrete 
set of values we find exp(Pt) J has Kruskal rank > 2. ■ 

The existence of rate matrices Q satisfying the hypotheses 
of the last lemma is a consequence of the following one. 

Lemma 9: Suppose a k x k rate matrix Q has at least c 
distinct eigenvalues and its right eigenvectors can be chosen 
to have all non-zero entries. Then for each j G [k] the 
vectors appearing as the jth rows of Q l , I = 0, . . . , c — 1, 
are independent. 

Proof: Let Q = UDU^ 1 be a diagonalization of Q. 
Then with Uj denoting the jth row of U, the jth row of Q l is 
UjD l U^ 1 . To show these rows are independent, it is enough 
to show the UjD 1 , I = 0, . . . , c — 1 are independent, or even 
that the projections of these vectors onto some choice of c 
coordinates are independent. By choosing to project onto c 
coordinates corresponding to distinct diagonal entries of D, 
we may reduce to the case where D is c x c with distinct 
diagonal entries and the vectors Uj G C c have all non-zero 
entries. 

But if W is the c x c matrix whose lt\\ row is UjD 1 , then 
W — Fdiag(uj) where V is a Vandermonde matrix con- 
structed from the diagonal entries of D. By our assumptions, 
both V and diag(uj) have non-zero determinants, so W does 



as well. Thus the rows of W are independent. ■ 
To see a Q satisfying the hypotheses of Lemma [9] exists, let 
1 



be a generalized Jukes-Cantor matrix of size k, all of whose 
off-diagonal entries are equal, which has stationary vector 
1 K . The eigenspaces of Qq are the span of 1 K and its 
orthogonal complement. For a diagonalization Qq = UDqU^ 1 
we can thus chose U to be an orthogonal matrix all of whose 
entries are non-zero. (For instance, when k = 4 we may 
choose U to be a Hadamard matrix.) Since Do has repeated 
diagonal entries, perturb the non-zero entries slightly to obtain 
a diagonal matrix D without repetitions, and let Q = UDU -1 . 
Since Q also has 1 K as its stationary distribution, and since 
Q is symmetric, it is a rate matrix of the sort needed. 

Choosing such a Q and any S and distinct ri for the 
sCov parameters gives a particular choice of scaled covarion 
parameters Q, S, {r{\ such that there exists a t\ > where 
exp(Rti)J(g)J has rank ck, and a > such that exp(Rtr)J 
has Kruskal rank at least 2. 

Thus Lemma [7] is fully established. 

VII. Identifying the covarion rate matrix R 

The next goal is to use v = fiPi and M = P^f exp(Rtg)P2, 
as identified in Section [V] through Lemmas [6] and [7] to 
determine the covarion root distribution /x and the covarion 
rate matrix R. It is of course enough to determine Rt$, 
where tg > is the edge length, and then use the required 
normalization of R. 

Let us assume v has its entries in non-increasing order. 
(This can be achieved by multiplying v on the right by 
some permutation P, and M on the left by P T , thereby 
changing the unknown Pi.) Now since diag(/x) exp(Rt) is 
symmetric, and diag(V) = P^f diag(/x)Pi, one can verify that 
diag(i>')AfP^ r Pi is symmetric as well. This shows there is 
at least one reordering of the columns of M that results in 
diag(i>')A/ being symmetric. Assume some such ordering of 
the columns of M has been chosen to ensure this symmetry. 

If v (equivalently, fi) has no repeated entries, these choices 
have uniquely determined an ordering to the rows and columns 
of M, and forced P2 = P\. To see this, note the rows of M 
have a fixed correspondence to entries of u, which have a 
unique decreasing ordering. For the columns, note that the 
symmetry of diag(i/)M and the fact that 1 CK M T = 1 CK 
implies uM — v. However, if the columns of M are permuted 
by P, then vMP = vP ^ v. We therefore can conclude 
v = /jlPi and M — P X T exp(Ptg)Pi for some unknown 
permutation Pi. 

Since v may have repeated entries, the above argument only 
holds for generic choices of parameters. In order to avoid 
introducing any generic conditions other than those already 
arising from the application of Kruskal's theorem, we give an 
alternate argument using the following lemma. 

Lemma 10: Suppose that a matrix M has a factorization of 
the form M = PW T ZW for some real symmetric positive- 
definite m x m matrix Z, real m x n matrix W of rank n, 
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and n x n permutation P. Then P is uniquely determined by 
M. 

Proof: The matrix Z defines an inner product on R m , 
and if Wi denotes the ith column of W, then the i,j entry of 
the symmetric matrix N = W T ZW is 

(Wi,Wj) Z = wfZWj. 

But for any inner product, if x / y then 

(x,x) + (y,y) > 2(x,y). 

Now the matrix W has distinct columns since it has rank n. 
Thus the entries of N satisfy 

nu + rijj > 2riij. (4) 

Suppose for some permutations Pi , P2 the matrices N\ — 
Pi M and N2 = P-fM are both symmetric, and have entries 
satisfying the inequalities ©. Note also that N\ and N2 have 
the same set of rows. 

Consider first the largest entry (or entries, in case of ties) 
of N% and N2. Because the inequality in (0 is strict, a largest 
entry cannot appear off the diagonal. Thus the row (or rows) of 
TVi and N2 containing the largest entry (or entries) must occur 
in the same positions. Since the same argument applies to the 
submatrices obtained from the Ni by deleting the rows and 
columns with the largest entries, repeated application shows 
Ni = N 2 . Thus Pi = P 2 . M 

Corollary 11: Suppose u, M are of the form 

v = juPi, M = Pi exp(Rt)P 2 , 

for some covarion rate matrix R with stationary vector fj,, 
permutations Pi,P2, and scalar t. Then P^ P2 is uniquely 
determined. 

Proof: Apply Lemma [TUl to diag(i>)Af, with P = P 2 , 
W = P 2 , and Z = diag(/x) exp(itt). ■ 

As a consequence of this corollary, after multiplying M on 
the right by (P^ P2) T we may now assume we have 

v = fiP, M = P T exp(i?i)P 

for some (unknown) permutation P. But then M = 
exp{P T RPt), and since this matrix is diagonalizable with 
positive eigenvalues, P T RPt is determined by applying the 
logarithm to its diagonalization. 

Now P T RPt is simply a rescaled version of R with the 
same permutation applied to rows and columns. Thus there 
exists at least one simultaneous permutation of the rows and 
columns of P T RPt which yields a rescaled covarion rate 
matrix. However, we do not yet know if there is a unique 
such permutation, or a unique such covarion rate matrix. 

One might suspect that the pattern of zero entries in the 
off-diagonal blocks of a covarion rate matrix should allow the 
(almost) unique determination of Rt from this permuted form. 
This is the content of the following lemma. 

Lemma 12: Let Ri, R 2 be rate matrices for Cov(c, k), with 
Ri class irreducible, as defined in Section [XT] Suppose for 
permutations Pi,P2, and scalars t\,ti > 0, that 

PfRiPih = PjRiP^. 



If c ^ k then ti = t 2 , and P = P1P2 can be expressed 
as P = P <g) P for some c x c permutation P and n x k 
permutation P. Thus Ri can be determined up to application 
of a permutation of the form P ® P. 

If Ri,R 2 are rate matrices for either sCov(c, k) or 
eCov(c, k), then the same result holds for all c. 

Note that a permutation of the form P®P can be viewed as 
a permutation of classes by P, and a simultaneous permutation 
of bases within all classes by P. 

Proof: Using the normalization of Ri and R 2 , it is trivial 
to see that ti = t2- Conjugating by P2, we obtain P T RiP = 
R 2 . 

Let N be a matrix of the same size as Ri, with entry 1 
(respectively, 0) wherever the corresponding entry of i?i is 
positive (respectively non-positive). Let Gi = G(Ri) be the 
(undirected) graph whose adjacency matrix is N — N T . Thus 
the vertices of Gi are labeled by the elements of [c] x [k], 
the indices corresponding to rows and columns of R\, and an 
edge joins vertices i and j exactly when Ri(i,j) > (or, 
equivalently, when Ri(j,i) > 0). Gi is the 'communication 
graph' of Ri, expressing which instantaneous state changes 
can occur. 

By assumptions on Ri, for each class i with Qi ^ 0, the 
vertices labeled j G [k], corresponding to all states in 

class i, form a clique (i.e., the subgraph on these vertices is 
a complete graph) of size k. Moreover, these cliques are each 
maximal, since any vertex outside of the clique has 

i' ^ i and is connected to at most one vertex in the clique, 
namely which has the same base but different class. 

Suppose first that c ^ k. In this case we show there are no 
other maximal cliques of size k. To this end, suppose a vertex 
labeled (i, j) is in some other maximal clique C of size k. The 
only vertices adjacent to it outside of its class correspond to 
the same base j. Thus C must contain at least one of these, say 
(k,j) where k ^ i. As the (k,j) vertex and any vertex 
cannot be in a common clique if j =fc I, C must contain only 
vertices corresponding to base j. As there are c / n of these, 
they cannot form a clique of size n. 

Now if we similarly construct G2 — G(R 2 ), the statement 
P T RiP — R2 means there is a graph isomorphism from 
Gi to G2, obtained by relabeling vertices according to the 
permutation P. As such an isomorphism must take maximal 
cliques to maximal cliques, we see that P must map all states 
in an i?i class with Qi ^ to all states in an R 2 class with 
Qj ^ 0. (As the covarion model allows at most one class 
with Qi = 0, this also means that if either Ri has a class with 
Qi = 0, then so does the other, and these classes must also 
be mapped to one another.) 

This implies P has the following structure: Partition P into 
a c x c matrix ofcxc blocks, corresponding to classes. All 
blocks of P are zero, except for one block in each row and 
column. Let P be the c x c permutation matrix with Is in 
positions corresponding to those non-zero blocks. The non- 
zero blocks of P are also n x k permutation matrices. 

We next claim that the non-zero k x k blocks in P are 
all identical. To see this, consider how P acts on a non-zero 
off-diagonal block Si x i 2 of Ri through the formula P T RiP: 
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the resulting block has the form P X T Si 1 i 2 P2 where P\ and P2 
are two of the k x k permutations appearing as blocks of P. 
But this must equal the corresponding block of R2, which is 
diagonal. Thus if all diagonal entries of S^i 2 are non-zero 
then Pi P2 = I K < so Pi = P2. The class irreducibility of Ri 
ensures that we obtain enough such equalities to see that all 
Pi are equal to some common k x k permutation P. Thus 
P = P ® P. 

Now for the models sCov and eCov consider the case of c = 
k. In this case, maximal cliques corresponding to either a fixed 
base or a fixed class have the same cardinality, but there can 
be no other maximal cliques. Unless the graph isomorphism 
from Gi to G2 maps some fixed-base clique to a fixed-class 
clique, our earlier argument applies. 

We therefore suppose that the base j clique is mapped to the 
class i clique, and argue toward a contradiction. This means P 
maps vertices in Gi labeled (k, j) for k = 1, . . . , c to vertices 
labeled (i, I) for I = 1, . . . , k in G-2- As a result, every other 
fixed-base clique in Gi must also map to a fixed-class clique 
in G2, since all the fixed-base cliques of G2 include some 

But the formula P T R%P = R2 implies that each diagonal 
block of i?2 must have as its k 2 — n off-diagonal entries the 
k 2 — k values Si ± i 2 ^ which appear in the off-diagonal blocks 
of R%. But this is impossible, since the base-change matrices 
Qi of i?2 are assumed not to be equal. ■ 

We now have determined R and up to separate permu- 
tations P of the bases and P of the classes. The ambiguity 
expressed by P cannot be removed, as permuting classes has 
no effect on the distributions defined by the model. Our next 
step is to use information on the ordering of the bases obtained 
at the leaves of the tree in order to determine P. 

Let P T Mj denote the c/tx n matrix, which was determined 
via Lemma [6] describing permuted transition probabilities on 
edge e 7 of the tree of Figure Q] Assuming P = P <E> P by 
previous steps in our analysis, (P(g>P) T exp(Pt 7 ) J is known. 

Lemma 13: Suppose W = P T exp(Rt)J for some permu- 
tation P = P ® P, covarion rate matrix R, and scalar t. Then 
P is uniquely determined. 

Proof: Consider the k x k matrix, determined by known 
information, 

J T di&g{v)W = J T P T diagO)PP T cxp(Pi 7 )J 

= (l c ®/ K )(P T ®P T )diag(/x)exp(Pi 7 )J 
= (1 C P T ® I K P T ) diag(/i) exp(Pi 7 ) J 
= (l c <g> P T ) diagO) exp(Pt 7 ) J 
= P T (1 C ® I K ) diag(/x) exp(Pt 7 ) J 
= P T N, 

where N = J T diag(/x) exp(Pi 7 )J. From Lemma[3] we also 
have that 

N = K T exp(Bt 7 )K 

where B is real diagonal and K has rank k. We may thus 
apply Lemma [10] to the product 

J T &i&g{v)W = P T K T exp(Bt7)K 



to determine P. ■ 
Thus for generic parameters, R and fi are determined 

uniquely, up to the permutation P of classes. 

Remark 2: That the restriction c < k is necessary for the 

Cov model in Lemma [12] can be easily seen. For example, 

with k = c — 2, the two rate matrices 
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are related by exchanging rates and classes. Note further that 
1 1 as their stationary distribution, so 



both R and R' have ±1 4 



they lead to the same observed distribution at a single leaf. 
Moreover, they lead to the same set of observable distributions 
at two leaves when one considers all possible edge lengths 
t > 0. Thus one cannot use the observed distribution at one 
or two leaves to distinguish between distributions arising from 
these two rate matrices. 

Of course one might next attempt to use observed joint dis- 
tributions at multiple leaves to distinguish these parameters, or 
introduce additional generic conditions to obtain identifiability 
of numerical Cov parameters even when c = k. As we have 
not pursued these directions, we do not claim identifiability 
fails for generic parameters in this case, but only that the 
arguments given above do not establish it. 

VIII. Identifying edge lengths 

As R is now known, all that remains is to determine edge 
lengths. By simple and well-known arguments [18], these can 
be determined from knowing total distances between leaves 
of the tree. Thus the determination of all edge lengths is 
established by the following. 

Lemma 14: Fix a covarion rate matrix R, of size ck x ck. 
Suppose a k x k matrix N is in the image of the resulting 
covarion model on a 2-taxon tree, with edge length t. Then 
N uniquely determines t. 

Proof: From Lemma [3] we have that 

N = K T exp{Bt)K, 

where B = diag(/?i, . . . , (3 CK ), = (i x > [3 2 > ■ ■ ■ > /3 CK and 
K is a real ck x k matrix, of rank k. Furthermore, since R is 
known, so are all and K. 

With K = (kji) and N = (riij), this implies the diagonal 
entries of N are 



THi = 2_\ fc?,exp(/9jt). 



3=1 



(5) 



As the kji are real numbers and all are non-positive, each 
term in this formula is a non-increasing function of t. Thus 
na — nu{t) is a non-increasing function of t. If we show 
that for some i the function nu(t) is strictly decreasing, then 
from any value of nu we may determine t. But to establish 



that some n« is strictly decreasing, we need only show there 
exists some i and some j > 1 such that kji ^ 0, so that at 
least one term in equation (0 is a strictly decreasing function. 
However, as K has rank k > 1, we cannot have fejj = for 
all j > 1. ■ 
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